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ABSTRACT 

Effect sizes of many common single nucleotide poly- 
morphisms identified in genome-wide association 
studies generally explain only a modest fraction of 
the total estimated heritability in a variety of traits. 
One hypothesis is that rare variants with larger 
effects might account for the missing heritability. 
Despite advances in sequencing technology, dis- 
covering rare variants in a large population is still 
economically challenging. Sequencing pooled 
samples can reduce the cost, but detecting rare 
variants and identifying individual carriers is difficult 
and requires additional experiments. To address 
these issues, we have developed a rare 
variant-detection algorithm V-Sieve to screen for 
rare alleles in pooled DNA samples which, in combin- 
ation with a unique pooling strategy, is able to effi- 
ciently screen a candidate gene for idiosyncratic 
variants in thousands of samples. We applied this 
method to 2283 individuals, and identified >100 poly- 
morphisms in the C-reactive protein locus at an allele 
frequency as low as 0.02%, with a positive predictive 
rate of 93%. We believe this algorithm will be useful 
in both screening for rare variants in genomic regions 
known to associate with particular phenotypes and in 
replicating rare variant associations identified in 
large-scale studies, such as exome re-sequencing 
projects. 

INTRODUCTION 

Genome-wide association studies (GWAS) have been suc- 
cessful in identifying single nucleotide polymorphisms 
(SNPs) associated with a spectrum of phenotypes 
ranging from quantitative traits, such as adult height (1), 



bone mineral density (2), age of menopause and age of 
menarche in women (3-5), to disease states including a 
variety of cancers (6-9). These polymorphisms are typic- 
ally common, with minor allele frequencies >5%. Even 
though dozens of SNPs were found to associate with 
each of these phenotypes, each common variant confers 
a small increment of risk, and in combination, the 
common variants explain a relatively small proportion 
of heritable variation — the portion of phenotypic 
variance in a population attributable to additive genetic 
factors. For discrete traits reported to date, the median 
odds ratio per copy of the risk allele is 1.33, with only 
three SNPs showing odds ratios >3.0 (10). One hypothesis 
for the large unexplained residual phenotypic variance is 
that in addition to the common polymorphisms identified 
by GWAS, rare variants with <5% frequency may also 
play a role in determining phenotypes. Advances in 
sequencing technology have made it possible to sequence 
the entire genome in an individual, but sequencing large 
populations remains impractical owing to the cost per 
sample. Rather than sequence individuals one at a time, 
an alternative approach that can reduce costs is to 
sequence pooled samples containing multiple individuals. 
There are two major challenges in detecting rare variants 
in pooled samples: first, distinguishing experimental noise 
from true polymorphisms with low frequencies in the pool 
is difficult and second, determining the identity of carriers 
of these rare variants within the pooled samples without 
additional experiments is not trivial. 

To address the obstacles in identifying rare variants in a 
pool, we develop an algorithm V-Sieve and demonstrate 
this approach by re-sequencing the C-reactive protein 
(CRP) locus in individuals in the Coronary Artery Risk 
Development in Young Adults (CARDIA) cohort (11). 
V-Sieve leverages the distribution of variant allele fre- 
quency across pools to find frequency spectra consistent 
with rare variants and inconsistent with background noise. 
We condense this into a single statistic and use the 
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predicted null distribution to choose a threshold for 
declaring the presence of a putative polymorphism in a 
pool. By applying V-sieve in serial over several different 
pooling schemes, we can identify individual carriers of 
rare variants in our samples. We verify these putative 
carriers through additional genotyping in all samples. In 
this study, we demonstrate that our approach is both sen- 
sitive and efficient in identifying rare variants in a pooled 
sample, as well as in determining the identity of individual 
carriers for low frequency alleles. 

MATERIALS AND METHODS 

Sample descriptions and preparations for sequencing 
experiments 

The CARDIA cohort is a population-based study initiated 
in 1985 to investigate the evolution of cardiovascular risk 
factors in a large biracial cohort of young adults (11,12). 
We used DNA from 2283 individuals (24 plates of 96 
samples, with 21 wells containing no DNA) from 
CARDIA in these experiments. Samples on these 96-well 
plates were combined into six 384-well plates. For each 
individual sample, we performed long range polymerase 
chain reaction (PCR) to amplify a 6kb genomic region 
surrounding the CRP locus (NCBI36/hgl8 chrl: 
157 945 884-157 951 857; forward primer: ATGTCTGTG 
ATCAGGCACACATTT; reverse primer: GGCATGTC 
CCTGAGATAAGAAATC) (Invitrogen). PCR products 
from 96 selected wells were combined into one pool, 
nebulized to fragment the pooled PCR products into an 
average length of 500 bp and labelled with one barcode 
(Paired-End DNA Sample Prep kit, Multiplexing Sample 
Preparation Oligonucleotide kit, Multiplexing Sequencing 
Primers, Illumina). For the first sequencing experiment, 
each of the six 384-well plates was divided into four 
separate pools based on the position of the wells 
(Supplementary Figure SI). This pooling scheme 
resulted in 24 pools overall, which were sequenced using 
two lanes on the Illumina Genome Analyzer IIx platform. 
In the second experiment, we used a combinatorial 
approach by designing three distinct pooling schemes 
used to pool samples. This approach aimed to assign 
each individual to three separate pools, one in each 
pooling scheme, and thus each individual was present in 
three separate pools and sequenced in three separate lanes 
(Figure 1). This yielded 72 libraries that were sequenced in 
six lanes. 

Read quality assessment 

The first sequencing experiment generated 53 bp forward 
and 53 bp reverse reads. Base quality values for the reads 
were in Illumina FASTQ version 1.5+. Raw reads were 
processed using the MAQ program (13). There were on 
average 5.7 million reads per pool, ranging from 2 to 8.2 
million reads (Supplementary Figure S2). Paired-end 
reads were mapped to the reference CRP sequence with 
the following MAQ parameters: 800 bp maximum allowed 
distance between two paired reads (— a800), maximum 2 
mismatches in the first 24 bp (— n2) and 200 as the 
maximum allowed sum of qualities of mismatches 



(— e200). We kept only pairs with one forward and one 
reverse reads in the analysis. Mapped paired reads with 
other configuration (forward-forward, reverse-reverse, 
reverse-forward) and orphan reads were discarded. On 
average, 95% of forward-reverse paired-end reads in 
each pool were successfully mapped back to the CRP ref- 
erence region (range: 89-97%). The percentage of reads 
with at least one mismatch averaged ~17% per pool 
(range: 16-19%), with an average of two mismatches per 
read per pool (range: 1.8-2.3). The average Phred base 
quality score per mismatch was ~ 14.3 across pools, 
indicating that most mismatches were likely sequencing 
errors. We excluded reads with at least 14 mismatches or 
a sum of mismatches quality scores of >200 from 
analyses, removing an average of 0.18% of reads from 
each pool. Coverage per position in each pool ranged 
from 334X-1328X, with an average base coverage of 
900X. Using bases with Phred quality score of >35, 
coverage along the CRP region is shown in 
Supplementary Figure S3. 

The second sequencing experiment initially generated 3 1 
bp single-end reads. The reads were mapped to the refer- 
ence sequence with the following MAQ parameters: 
number of mismatches in the first 24 bp set to 2 (— n2) 
and maximum allowed sum of qualities of mismatches 
set to 200 (-n200). We filtered out reads with >5 
mismatches or with a sum of qualities of mismatches 
> 1 50. A second run of the same libraries generated 
3 1 bp forward and 3 1 bp reverse paired-end reads, and 
we used the same mapping parameters as the first 
sequencing experiment to map back to the reference 
sequence. We obtained an average of 2.4 million reads 
per pool, ranging from 0.25 to 4.6 million 
(Supplementary Figure S4). Percentage of reads with at 
least one mismatch averaged ~12.6% per pool (range: 
8.4-28%). Average number of mismatches per read is 
1.3 mismatches across all libraries. The average quality 
score per mismatch ranged from 11.4 to 24.5, with an 
average of 20.9 across all pools. We excluded reads with 
at least six mismatches or a sum of mismatches quality 
scores of >300 from later analyses, removing an average 
of 0.03% of reads from each pool. The filtered reads from 
both runs were combined for variant discovery, resulting 
in an average coverage per base position of 705X in each 
pool. Using bases with quality score of >Q30, coverage 
along the CRP region is shown in Supplementary 
Figure S5. 

Variant calling 

Positions along the CRP reference region were screened 
for rare variants independently. To determine whether a 
position was polymorphic in the first sequencing experi- 
ment, we used only reads in eligible pools. A pool was 
eligible if it contained at least 5000 reads covering that 
position with a base quality score of at least Q35. Using 
only bases with high quality lowered the probability of 
identifying a sequencing error incorrectly as a poly- 
morphic variant. Using pools with a base coverage of at 
least 5000X at that position ensures that, even if only 1 of 
192 chromosomes in the pool is polymorphic, we would 



Page 3 of 1 5 



Nucleic Acids Research, 2013, Vol. 41, No. 7 e85 



r Pooling 1 

Grouped by columns 

A 1 ->2 4 



12 

-11- 

10 

Pool 9" 

- Pool 

Pool 7 



Pool 7- 
Pool9- 
Pool 1 1- 
1- 



Pooling 2 
Grouped by rows 









































































































T 
















































































- 








































































1 














_L 






1 
































r 


Plate 1 






















1 










































































■ 














1 









































































































































































































































— ► 
















































































































































































































































































































































Plate 2 


















































































































































































































































































































1 







































































































































































1 
















































































































































































1 








































Plate 3 




















































































































1 
















































































































































































1 






1 


1 





























































Pooling 3 

Grouped by quadrants 



EHI 

mm 



!IIIII%*I«IIIII2KM 
iiiiii*9Miiiiiiy.KMiiiiii»*HiiiiiK»«n 
■■ SKU K*H Nil SS 

snm ksh ssm ss 
:"Mih.'::-:ii» 



m 



1 



□ Po< " 1 □ 



□ 



□ 



Figure 1. Pooling scheme for sequencing experiment 2. We selected 96 samples across three 384-well plates based on their position on the plate, such 
that each sample was present in exactly 3 of the 72 pools and majority of the samples can be uniquely identified based on the pattern of positive 
pools. For the first 24 pools, samples in the same columns were combined. For the next 24 pools, samples in the same rows were combined. Lastly, 
samples in the same well position across plates were combined into a total of 24 pools. This pooling scheme resulted in a total of 72 pools that can be 
sequenced in six lanes on Illumina Genome Analyzer IIx platform. 



expect to observe the rare allele >26 times. An eligible 
pool in the second sequencing experiment was defined as 
having >3000 reads with a base quality score of at least 
Q30 for a given position. A lower threshold for read depth 
was chosen because the read length for the second experi- 
ment was 31 bp, instead of 53 bp in the first experiment. 
Even at this read depth, we would expect to observe the 
rare allele >15 times in a pool if present. 

For each base, we considered the major allele to be the 
most commonly observed allele with highest frequencies in 
each pool across all eligible pools. The minor allele was 
called in two ways; first, the allele with the second highest 
average frequency across all eligible pools and second, the 
allele which was the second most common allele across 
majority of the available pools. In most circumstances, 
these two approaches resulted in the same minor allele; 
however, they gave different results when the variant was 
potentially tri-allelic (Supplementary Figure S6). We 
imposed a stringent criterion requiring the minor allele to 
have a frequency of at least 0.5%, which equals a single 
copy of rare allele in a pool of 96 individuals. In cases 
where minor allele calls differ between these two 
approaches, suggesting the base may be tri-allelic, both 
minor alleles were considered separately. We considered 
Fj be the frequency of the minor allele under consideration 
in the i' h pool. For each potential variant, we calculated the 
following metric, STAT: (X / Y) x Z, where X is the range 
of Fj over all pools, Y is the sum of pairwise absolute dif- 
ferences in Fj between pools and Z is the standard deviation 



of Fj over all pools. When STAT was greater than the 
threshold generated by the noise model (described in the 
next paragraph), the minor allele was deemed valid and 
hence the base position was called polymorphic. 

For any given base, we set F/ 2> to be the frequency of 
the second-most frequent allele in pool i. Given pools of 
96 individuals (192 chromosomes), the lower bound for 
Fi was 0.5% when there was only one rare allele 
present in the pool. Values of F/ 2) <0.5% were assumed 
to be due to experimental noise. Therefore, we constructed 
a noise model, F nots J 2 , from the distribution of F/ 2> 
across all positions from all pools where F- 2> < 0.5%. 
Fnoise was different by reference alleles, CG and AT 
(Figure 2). We modelled the distribution of noise separ- 
ately for CG and AT using two Gaussian models in R 
(function: nls). After the Gaussian models had been con- 
structed, we performed 10000 simulations based on the 
assumption that only one pool contained a variant 
among various numbers of total pools. In each iteration, 
the null model randomly drawn F/ 2> from the given 
Fnoise distribution for each pool i. For the variant 
model, F/ 2> for the pool containing the variant was set 
to 0.5% while F,- i^/ 2> was drawn from F nois J 2) . After 
each iteration, we calculated the metric STAT. For a given 
number of total pools, the threshold for STAT was 
defined as the 1% quantile of these 10000 iterations. 
The behaviour of the metric STAT was investigated by 
altering different values for the number of total available 
pools and the pre-determined minor allele frequencies. 
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Verification of variants 

To validate the rare variants we identified from sequence 
data, we used an allele-specific primer extension assay on 
the BeadXpress platform (VeraCode Golden Gate 
Genotyping kit, Illumina) to genotype 81 loci for all 
samples (Supplementary Table SI). Eight of the 81 loci 
were predicted to be noise by our algorithm and were 
included to assess the specificity. The predicted variants 
were selected first, followed by choosing presumably 
monomorphic sites that were technically compatible with 
the set of predicted variants. As a result, the predicted 
monomorphic sites were not sampled randomly over the 
distribution of distances between their ST A T values and 
the threshold values generated by the noise model 
(Supplementary Figures S7 and S8). We failed to 
genotype seven variants owing to incorrect primers. In 
addition, 11 variants exhibited genotype patterns that 
were consistent with both being monomorphic and experi- 
mental failures. Thus, we designed primers for these 18 
loci, 1 potentially tri-allelic locus, and 3 more loci that 
have been confirmed on BeadXpress to be sequenced 
using 454 Sanger sequencing. A small number of individ- 
uals were selected based on their pooling pattern for each 
variant. We inspected the traces to determine whether the 
Golden Gate assay failed or that the variants were incor- 
rectly predicted by our method. 

Assigning rare variants to individuals 

For each verified variant, pools with F (2> greater than 95th 
percentile of F noL J 2> were deemed to contain carriers. We 
composed a list of potential carriers from these pools for 
each variant. Since variants were verified either by 
genotyping all individuals using the Golden Gate assay 
on the BeadXpress platform (Illumina) or by genotyping 
a small subset of individuals using Sanger sequencing, we 
evaluated these two sets of variants separately. For 
variants verified using Golden Gate assay, we determined 
the percentages of individuals with copies of alternate 
allele in the list of individuals in positive pools. For 
variants verified using Sanger sequencing, we calculated 
the expected number of heterozygous individuals by multi- 
plying the average allele frequencies in positive pools by 
192 and compared that with the number of individuals we 
identified. 

Simulation of probability of unambiguously identifying 
carriers of rare variants 

We used simulation to estimate the probability that 
carriers of a rare allele could be unambiguously identified 
by the pattern of pools positive for the rare allele. 
Assuming a fixed total population size (AO and a fixed 
number of individuals in each pool (np), if each individual 
is present in exactly one pool per dimension, the total 
number of pools per dimension is P = N/np. In our test 
case, N = 2304 and np = 96, soP= 24. Define a singleton 
as an allele carried by exactly one person in the popula- 
tion. If a singleton is present, only one pool will be positive 
for the presence of the rare allele per dimension. To un- 
ambiguously assign the identity of a carrier, the number of 



patterns for positive pools must exceed the number of in- 
dividuals in the population. The number of possible 
singleton patterns is p D , where D is the number of dimen- 
sions in which the population was pooled. Singleton alleles 
can be unambiguously assigned to an individual in the 
population as long as p D >N, so for our example, three 
dimensions are adequate to unambiguously identify 
singleton carriers in a population of 2304 individuals 
because 24 3 > 2304. 

Unambiguously identifying carriers of rare alleles when 
more than one carrier is present is more challenging, but if 
the solution space of patterns is sparse enough, then it 
would be possible. The number of possible patterns in- 
creases as a function of the number of positive pools in 
each dimension, so we suspected that a modest number of 
additional dimensions would be adequate. Given the car- 
riers of a rare allele, the number of possible patterns is 

~l I (assuming no pools carrying more than one 

carrier in each dimension), but we found it theoretically 
challenging to directly calculate the likelihood that any 
doubleton pattern would resolve unambiguously as the 
union of only one pair of individuals present in the 
population. 

In the absence of a theoretical solution, we simulated 
data under N = 2304 individuals, np = 96 individuals per 
pool, P = 24 pools per dimension, D between three and five 
pooling dimensions and a between one and five rare allele 
carriers. In each pooling dimension, the pool assignments 
of individuals were made such that they were as orthogonal 
to previous dimensions as possible, producing a three, four 
or five digit pooling code for each individual. Then, for 
each value of a, we randomly drew a individuals from 
our population. The pooling codes of these a individuals 
were combined to create an observed pattern of positive 
pools, and then we searched to see whether we could find 
any other combination of a individuals who were different 
from the initial individuals but generated the same observed 
pattern. We repeated for 100000 iterations and calculated 
the average frequency with which the observed pattern was 
consistent with one and only one combination of an indi- 
vidual, yielding unambiguous identity for the carriers. We 
obtained the mean and standard deviation by repeating 20 
sets of these 100000 iterations. 



RESULTS 

High coverage sequencing of the CRP genomic region 

We aimed to discover new rare variants in the CRP region 
by first deeply sequencing this region in individuals in the 
CARDIA cohort. In each individual, we amplified the 
CRP region and then pooled the amplicons from 96 indi- 
viduals. The pooled PCR products were used to generate a 
barcoded library compatible with the Illumina Genome 
Analyzer (GA) IIx sequencer. We undertook two rounds 
of sequencing. In the first round, we identified putative 
polymorphisms from 24 of these pooled libraries contain- 
ing disjoint samples, but did not attempt to map it to 
individuals. In the second round, we strived to discover 
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new rare variants and identify individual carriers simul- 
taneously by re-assorting all individuals combinatorially 
into three sets of 24 pools for a total of 72 pools 
(Figure 1). Effectively, each sample was sequenced once 
in each set of pools and exactly three times across the 72 
pools. Thus, the singleton alleles carried by a single indi- 
vidual should be confirmed in three different pools in three 
separate lanes, and we would be able to identify that in- 
dividual based on the pattern of positive pools. We 
describe the quality metrics of the two rounds of 
sequencing in the METHODS. In both rounds, we 
filtered reads with low total Phred (basecall) qualities. 
The coverage per base in each pool averaged 900X in 
the first round, and 705X in the second round. 

A null model of variant allele frequency 

A base in a sequencing read can differ from the reference 
base due to experimental noises introduced at various 
stages rather than true polymorphisms. The Phred quality 
score measures the accuracy of sequencing PCR products, 
but the PCR products themselves could be synthesized with 
substitutions due to errors introduced earlier, for example, 
by DNA polymerases. Since we filtered reads based on their 
Phred qualities, we only removed sequencing errors and not 
errors introduced prior to sequencing. To differentiate a 
polymorphic base from these other experimental noises, 
we needed to first inspect the noise distribution in the 
sequencing experiment. 

We considered F/ 2 \ the ratio of reads of the 
second-most frequent allele to the total reads at a given 
base in pool i. A reasonable distribution for F, (2) across all 
positions is occasional large values, corresponding to 
polymorphic positions, superposed on background noise. 
If the DNA from the 96 diploid samples in a pool were 
amplified equally, the minimum value F/ 2) obtains for a 
true polymorphism would be 1/192, ~0.5%. We made this 
simplifying assumption and constructed the noise distri- 
bution, Fnohe , by truncating the distribution of F/ 2 \ 
i = 1,„24, at 0.5%. Although, F noi J 2> is non-negative, 
we found that a normal distribution fit reasonably. The 
distribution was also dependent on whether the reference 
base was C/G or A/T (Figure 2). The bases that form 
triple hydrogen bonds (guanine and cytosine) have a 
10-fold lower mean error. In the first sequencing experi- 
ment, the noise model for A/T reference base had a mean 
of 0.0021 (standard deviation = 0.0094), and the noise 
model for C/G reference base had a mean of 0.00038 
(standard deviation = 0.00019). For the second 
sequencing experiment, the noise model for A/T reference 
base had a mean of 0.0022 (standard deviation = 0.00089) 
and the noise model for C/G reference base had a mean of 
0.00039 (standard deviation of 0.00023). 

ST A T: Detecting signals for true polymorphisms 

We next constructed a metric to determine whether an 
alternative allele at a given position was likely poly- 
morphic or could be generated from the noise model. 
Let Fj (n) be the frequency of the n th most frequent 
allele in the i' h pool. For a given position, we calculate 



the following metric, 

STAT = (X/Y)*Z, 

where Xis the range of F/ !,) over all pools, Fis the sum of 
pairwise absolute differences in F/" J between pools and Z 
is the standard deviation of F/ n> over all pools. If no 
variant exists in any pool, the range and standard devi- 
ation of Fj (n> would be small, since F/ n) would only 
sample from the noise distribution, and the ratio of 
range to sum of pairwise absolute difference in F/ n) 
would also be small, making the value of STAT small 
(Figure 3A). If a rare variant exists in one of the pools, 
the range of F/ n) over all pools increases, resulting in an 
increase in the value of STAT (Figure 3B). To consider the 
case of multi-allelic SNPs, we calculated STAT for n = 2, 
3 and 4 for all positions in the sequenced region. Next we 
describe a way of finding a threshold for STAT to ap- 
proximately control the false discovery rate. 

Null distribution of ST A T and its sensitivity in detecting 
true polymorphisms 

We investigated the sensitivity of STAT to detect true 
polymorphisms in simulations with various numbers of 
total pools. For a given number of n total pools, we set 
one pool to contain one copy of the variant. In other 
words, the value of F <2> was set to be 0.5% in this 
positive pool. For the remaining n—1 pools, F'" 1 was 
drawn from the appropriate noise model. We then 
calculated STAT using F <2> from all pools and repeated 
for 10000 iterations to generate a distribution of STAT. 
We also generated a null distribution of ST A T assuming 
there was no variant in any of the pool. In this simulation, 
F <2> in all pools were drawn from the noise models. To be 
conservative, we took the first percentile from the distri- 
bution of ST^4rwith one positive pool and compared it 
with the median from the null distribution of STAT. We 
found that regardless of the number of total pools, the 
value of STAT calculated from one positive pool was 
always greater than when none of the pools contained a 
variant (Figure 4A). Thus, at boundary of detection (one 
polymorphic allele in the entire set of pools), we have 99% 
power. 

We also investigated the sensitivity of ST A T to detect 
true polymorphisms when F""' in the positive pool varies. 
We found that STAT was linearly related to the number of 
copies of variant in a given pool, regardless of the number 
of total pools (Figure 4B). These results suggested that 
STAT was a sensitive and robust metric to use in iden- 
tifying variants with frequencies as low as a single copy 
in a pool of 96 samples. We declared position and allele 
n to be a candidate variant if its STAT value exceeded 
the first percentile under the alternate distribution 
and F <2> in positive pools were >0.5% (Supplementary 
Table S2). 

Variant discovery and validation 

Assuming all samples in a pool were amplified equally, we 
used the first percentile ST A T values as a threshold and 
identified 130 putative variants in the 6kb CRP region. 
These 130 putative variants all have a minor allele 
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Figure 2. Distribution of noise by different reference allele [F m 



J. For any given position along the CRP reference region, we define Ff ' as the 
frequency of second-most frequent allele in pool i. Assuming all 96 samples in a pool were amplified equally, the minimum value of F/ 2> is 0.5%. 
Noise, F nois J 2> , is thus defined as any F/ 2 ' <0.5% across all pools and all positions. Distributions of noise differ with respect to the reference allele, 
regardless of the experiment (A: first sequencing experiment, and B: second sequencing experiment). 
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Figure 3. Examples of distributions of F* 21 across all pools. F/ 2> is defined as the ratio of reads of the second-most frequent allele to the total reads 
at a given base in pool i. For a given position, we constructed the metric STAT to determine whether an alternative allele at a given position was 
likely polymorphic. ST A T is defined as (X j Y) x Z, where X is the range of F/ 2> over all pools, Y is the sum of pairwise absolute differences in F- 2 ' 
between pools and Z is the standard deviation of F/ 2> over all pools. When there is no alternate variant at a position, the range and standard 
deviation of F/ 2> remain small even against a high background error rate, resulting in a small STAT value (A). If a rare variant exists, the value of 
STAT increases as the range of F/ 2> increases (B). 



frequency of at least 0.5%. In the scenario where samples 
in a pool were not amplified equally, the minor allele 
frequency of a single copy variant might not be >0.5%. 
To take into account this possibility, we declared any 
position to be a putative variant if its STAT value 



exceeded the first percentile under the alternative distribu- 
tion without requiring any minimum minor allele fre- 
quency. We obtained another eight candidates with 
minor allele frequencies ranging from 0.02 to 0.1%. 
Together, there were 138 putative variants. Of these 138, 
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Figure 4. Robustness of the STAT metric. We compared the values of STAT between two models across various numbers of total available pools: 
the null model assumed all pools contained noise, where minor allele frequencies were drawn from F nois J 2> , and the variant model assumed that only 
one pool contained the variant with a frequency of 0.5%. We performed 10000 simulations and compared the 1% quantile of STAT {torn the variant 
model to the mean of STAT from the null model (A). We also investigated the values of STAT across different minor allele frequencies in the 
positive pool (B). 



54 polymorphisms were previously documented in the 
dbSNP database (Table 1). Ninety of the 138 variants 
are located in the intergenic region flanking CRP; 2 are 
in the 5' UTR region (chrl: 157 950 900-157 951 003); 8 are 
in the intron (chrl: 157 950 553-157 950 838); 13 are in 
exon 2 of CRP (chrl: 157 949 939-157 950 552); and 25 
are in 3' UTR region (chrl: 157 948 704-157 949 938). 
Four putative variants were flagged as potentially 
tri-allelic. The majority of the variants discovered have a 



frequency <1%, indicating that rare polymorphisms are 
much more prevalent than common polymorphisms in 
CRP (Figure 5). 

To assess the predictive power of our algorithm, we 
genotyped 73 candidate variants (genotyping all 138 was 
not possible due to oligo restrictions) and eight loci pre- 
dicted to be monomorphic to verify using either the 
Illumina Golden Gate genotyping assay or the 454 
Sanger sequencing assay (Supplementary Table S3). 
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Table 1. All variants discovered in sequencing experiments 1 and 2 
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Fifty-four of the 73 candidates were verified to be poly- 
morphic by genotyping all CARDIA individuals using the 
Golden Gate assay. An additional 14 candidates were 
verified by Sanger sequencing in a subset of individuals. 
Of the eight loci which were deemed monomorphic by 
V-sieve, three were found to be polymorphic; two by 
Golden Gate assay and one by Sanger sequencing. Only 
one of three false negatives, rs 114985520, was reported in 
the dbSNP database (version 135). Overall, we verified 68 
of 73 putative variants, including two tri-allelic loci. This 
results in a positive predictive value of 93% (Table 2). 



In the second sequencing experiment, we identified 157 
variants across the 72 pools. We obtained another four 
candidate variants if we relaxed the minimum frequency 
requirement on the alternative alleles. The majority of 
these predicted variants were also discovered in the first 
sequencing experiment (73%) (Table 3 and Supplementary 
Figure S9). We used the same set of verified variants to 
compare the sensitivity of the second sequencing experi- 
ment to the first. In total, 65 of 69 predicted variants were 
confirmed, including two tri-allelic variants. We achieved 
a 94% positive predictive rate in this experiment. 
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Identifying individuals with new rare variants from 
sequencing experiment 

Conducting two sets of experiments to discover new rare 
variants and then to identify individuals carrying these 
variants is both time consuming and labour intensive. In 
our second sequencing experiment, we strived to accom- 
plish these two goals simultaneously and evaluated our 
ability in identifying individuals carrying the 63 verified 
variants. For each variant, pools with F (2> greater than 
the 95% F noi J 2> threshold were considered to contain 
carriers. Since each individual was sequenced three times 
in three different pools, most individuals in the second 
sequencing experiment had a unique pooling pattern 
that we can use to identify them. Based on the pattern 
of positive pools, we gathered a list of potential individ- 
uals carrying each variant. We cannot narrow down 16 of 
the 63 variants to specific individuals since they were too 
prevalent (found in 60% of pools and had an average 
frequency >1% across pools). For the other 47 variants, 




<0.05% 



0.05%to0.1% 0.1% to 0.5% 0.5%tol% 

Allele Frequency in a Fool of 96 Individuals 



Figure 5. Distribution of average frequencies in a pool of 96 individ- 
uals for variants discovered in the first sequencing experiment. 



Table 2. Verification of variant discovered in the first sequencing 
experiment 

Total 



GoldenGate Assay and 
Sanger sequencing 



First sequencing 
experiment 



Predicted 
variant 



Predicted 
noise 



True variant 

Noise 

Total 



5 
73 



71 
10 

81 



the lists of potential carriers included at least one con- 
firmed person with the variants. 

We further explored whether for a given variant, all 
confirmed individuals carrying the alternate allele were 
on the list of potential carriers, a proxy for the sensitivity. 
For 33 variants verified by Golden Gate assay, we con- 
sidered both the number of potential carriers (target size) 
and the percentage of confirmed individuals (success rate). 
The target size is an estimate of the number of capillary 
sequencing experiments that we would have performed if 
we did not use the Golden Gate assay to genotype all 
samples. We found that for 16 rare polymorphisms with 
frequencies <0.05% in a pool, we were able to identify all 
individuals carrying the alternate allele (Figure 6). The 
median target size was four, suggesting that we only 
needed to genotype four subjects per variant to identify 
all individuals carrying the alternate allele using capillary 
sequencing. For four variants with frequencies between 
0.05 and 0.1%, we were able to identify all confirmed in- 
dividuals, and the median target size was 56 people. For 
nine variants with frequencies between 0.1 and 0.5%, we 
were able to identify, on average, 67% of confirmed indi- 
viduals. The median target size was 99. For four variants 
with 0.5-1% frequency, we were able to identify 81% of 
confirmed individuals. The median pool size was 960, 
indicating that the target size grew nearly exponentially 
in the minor allele frequency. We performed a total of 
59 capillary sequencing experiments in a subset of individ- 
uals for 14 variants (Figure 7). On average, we genotyped 
4.2 subjects per variant. We calculated the expected 
number of individuals with a variant by multiplying the 
average allele frequencies in the positive pools by 192. For 
nine variants, we identified all individuals carrying that 
variant by capillary sequencing. In summary, our success 
in identifying all individuals with a specific variant 
decreased as a variant became more common. For rare 
variants with frequencies <0.5% (equivalent to having 
11 copies in our cohort), we were able to achieve high 
success rate in genotyping < 100 individuals per variant. 



DISCUSSION 

Common SNPs identified in GWAS to date generally 
account for small effects, leading to the hypothesis that 
rare variants might also play a functional role in 
determining phenotypes. Whole-genome sequencing of in- 
dividuals is not yet feasible due to the large cost. An al- 
ternative approach is to sequence a pool of multiple 
individuals, identify the polymorphisms present in the 
pooled sample, and then genotype the discovered 



Table 3. Overlap between verified variants in sequencing experiments 1 and 2 



GoldenGate Assay and 
Sanger sequencing 




Predicted variants 






Predicted noise 
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Experiment 
1 only 


Experiment 
2 only 


Both 
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2 only 
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71 
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Figure 6. Median size of pools containing potential individuals carrying variants and the median percentage of verified individuals with variants of 
different frequencies. 




polymorphisms in each individual to assign individual 
genotypes. A major challenge with a pooled variant-detec- 
tion approach is discriminating between background tech- 
nical noise and true rare polymorphisms. To address this, 
we have developed a detection algorithm, V-sieve, which is 
able to efficiently screen a candidate gene for idiosyncratic 
variants in a large population, with a positive predictive 



rate of 93%. We were able to identify polymorphisms with 
an allele frequency as low as 0.02%, corresponding to a 
singleton rare allele carried by one individual in our 
cohort of 2283. 

There are several major advantages with our method. 
First, the noise model was constructed directly from the 
experimental data, effectively normalizing for technical 
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noise specific to that experiment and requiring no add- 
itional experimental control. Second, even though we 
focused on novel rare variant discovery, we were equally 
successful in detecting common variants. Thirteen of the 
138 variants we identified had an allele frequency >5%, all 
of them known in dbSNP. Although we assumed equal 
contribution of each individual in each pool and set 
allele frequency of 0.5% (1 copy in 192 chromosomes) 
as a lower bound, the robustness of our method still 
enables us to discover variants with allele frequencies 
lower than that. We identified eight singleton SNPs with 
allele frequencies <0.5% in any specific pooled library, 
presumably due to imprecise pooling of the original 96 
samples in each pool. Four of these putative variants 
were genotyped in all individuals, and three were con- 
firmed to be real polymorphisms. Finally, our methods 
identified several tri-allelic polymorphisms. 

We included both sites predicted to be polymorphic and 
monomorphic by V-sieve for verification. The eight 
presumed monomorphic sites were chosen to be compatible 
with the predicted variants already in the set. Technical par- 
ameters such as primer compatibility defined by the Illumina 
BeadExpress platform and distances from predicted variant 
sites were considered. As a result, these eight sites were not 
sampled randomly over the distribution of distances between 
their ST A rvalues and the corresponding threshold values 
generated by the noise model (Supplementary Figure S8). 
They were in fact biased towards the ST A rvalues close to 
the threshold, making them far from a comprehensive set of 
loci to test for false negatives. It was thus not that surprising 
that three of them turned out to be true rare variants. Two 
of the three sites have minor allele frequencies <0.1%. We 
suspect that at such low minor allele frequency levels, the 
unequal contribution of each sample in the pools becomes a 
bigger factor in influencing whether a site is called variant by 
V-sieve. This suggests that for some experiments where iden- 
tifying as many rare variants is a higher priority than 
limiting the number of false positives, lowering the threshold 
values set by the noise models might help uncovering add- 
itional true variants. 

Sequencing pooled samples, followed by genotyping 
predicted variants in all individual samples, can be more 
cost-effective than sequencing all individuals in the target 
population. This approach, however, is labour-intensive 
and still costly due to the genotyping experiments using 
Golden Gate assay required to individually identify single- 
ton heterozygotes. Our second approach used a carefully 
designed strategy to pool samples such that most of 
samples can be uniquely identified by their presence in 
exactly 3 out of 72 pools of 96 individuals, enabling us 
to simultaneously screen for rare variants, and pinpoint 
individuals carrying them on the basis of which pools 
were positive for the rare allele. The efficiency of this 
approach depends on the accuracy of identifying 
carriers. Unambiguous determination of carrier identity 
is made possible by combinatorial pooling, where each 
individual is present in multiple dimensions (pools). In 
our test case, when only one rare allele carrier is present 
in the population, we can unambiguously identify carrier 
identity using three dimensions of parallel pooling, such 
that every individual is present in exactly three pools 
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Figure 8. Probability of unambiguously resolving carrier identities 
across different pooling schemes. We simulated and compared the 
probability of unambiguous detecting individuals carrying one to five 
copies of rare variant in our population using three to five parallel 
pooling strategies. On average, we can distinguish up to five copies 
of rare variant >50% of the time using just four parallel pools, 
where each individual appears four times in four separate pools in 
the experiment. If we use an additional pooling, we could determine 
individuals carrying quintuplets almost 100% of the time just from the 
sequencing data. 



(Figure 8). When two carriers are present in the popula- 
tion, with three dimensions we can unambiguously 
identify both carriers ~53% of the time. In other words, 
53% of the time, we do not need additional experiments to 
verify the carriers. As the number of copies of variant 
increases, the probability of unambiguously identifying 
carriers decreases. 

However, additional pooling dimensions can overcome 
these ambiguities. With four dimensions of parallel 
pooling, even when five carriers are present in the 
overall population we can unambiguously resolve the 
identities of all five carriers >75% of the time. This prob- 
ability increases to 98% with five dimensions of parallel 
pooling. The drawbacks of our multi-dimensional pooling 
strategy include the cost of additional sequencing libraries, 
and the increased amount of DNA required from each 
sample. Therefore, we suggest that the most comprehen- 
sive and efficient strategy for the identification of all 
variants in a population and the assignment of individual 
genotype to all individuals in the population, both in 
terms of cost and labour, may be to use the combinatorial 
pooling strategy to discover new variants and identify 
carriers of rare variants with up to five carriers in the 
population, with subsequent genotyping of more 
frequent polymorphism in all individuals, to unambigu- 
ously assign genotype at the more common variants. We 
believe this efficient algorithm will be useful in both 
screening genomic regions known to associate with par- 
ticular phenotypes from GWAS for rare variants, and in 
screening large populations for an excess burden of rare 
variation in candidate genes (14-17). 
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